Significance analysis of microarrays

ABSTRACT

Microarrays can measure the expression of thousands of genes and thus identify changes in expression between different biological states. Methods are needed to determine the significance of these changes, while accounting for the enormous number of genes. We describe a new method, Significance Analysis of Microarrays (SAM), that assigns a score to each gene based on the change in gene expression relative to the standard deviation of repeated measurements. For genes with scores greater than an adjustable threshold, SAM uses permutations of the repeated measurements to estimate the percentage of such genes identified by chance, the false discovery rate (FDR). When the transcriptional response of human cells to ionizing radiation was measured by microarrays, SAM identified 34 genes that changed at least 1.5-fold with an estimated FDR of 12%, compared to FDRs of 60% and 84% using conventional methods of analysis. Of the 34 genes, 19 were involved in cell cycle regulation, and 3 in apoptosis. Surprisingly, 4 nucleotide excision repair genes were induced, suggesting that this repair pathway for UV-damaged DNA might play a heretofore unrecognized role in repairing DNA damaged by ionizing radiation.

CROSS REFERENCE TO RELATED APPLICATION

[0001] This is a continuation-in-part of U.S. patent application Ser.No. 60/208,073, filed May 4, 2000, which is hereby incorporated byreference in its entirety for all purposes.

BACKGROUND OF THE INVENTION

[0002] This invention relates in general to statistical analysis of generelated data and, in particular, to analysis of microarray data foridentifying genes that exhibit statistically significant behavior.

[0003] Different biological systems are characterized by differences inthe copy number of genes or in levels of transcription of particulargenes. By measuring such biological phenomena, insight into and possibletreatment of human diseases may be found.

[0004] Microarrays of various types have been employed for measuring theexpression levels of large numbers of genes. One type of microarray isthe oligonucleotide microarray, one example of which is the Gene Chip®microarray manufactured by Affymetrix corporation of California.International Patent Application PCT/US96/14389, which is incorporatedherein in its entirety, describes a method for measuring gene expressionlevels using oligonucleotide microarrays. In the method described, anucleic acid sample is hybridized to a high density array ofoligonucleotide probes immobilized to a surface, where the high densityarray contains oligonucleotide-type probes complementary to sequences ofthe target nucleic acids in the nucleic acid sample. For example, RNAtranscripts of one or more target genes may be hybridized to an array ofoligonucleotide probes immobilized on a surface such as that of asemiconductor chip. Some of the probes on the surface have sequencesthat are perfectly complementary to particular target sequences and arereferred to herein as perfect match (PM) probes. Also present on thechip are probes whose sequence is deliberately selected not to beperfectly complementary to a target sequence. Such probes are referredto as mismatched (MM) control probes, where for each PM probe, there isa MM control probe for the same particular target sequence. Thismismatch may comprise one or more bases. Thus, the biological samplesuch as a mRNA sample can be analyzed for gene expression forhybridization to above-described microarray on a chip. The presence ofRNA sequences that bind to the oligonucleotide probes on the chips arethen detected by methods such as tagging with a fluorescence materialand then detecting the fluorescence. Since sequences that are differentfrom the target sequences may also bind to the PM probes that correspondto such target sequences, the fluorescence signals from such sequenceswould appear as noise. Signal-to-noise ratio is improved by calculatingthe difference from signals from the sequences that bind to the PMprobes and the signals from sequences that bind to the MM probes.

[0005] Another type of microarray that has been used for analyzing geneexpression utilizes cDNA probes. Although massive amounts of data aregenerated using oligonucleotide or cDNA probes, quantitative methods areneeded to determine whether differences in gene expression areexperimentally significant. Previous work on microarrays has utilizedcluster analysis, to find coherent in expression patterns among genes orin cells. See, for example, the following three articles:

[0006] 1. Alizadeh, A., Eisen, M., Davis, R., Ma, C., Lossos, I.,Rosenwal, A., Boldrick, J., Sabet, H., Tran, T., Yu, X., Marti, G.,Moore, T., J, H., Lu, L., Lewis, D., Tibshirani, R., Sherlock, G., Chan,W., Greiner, T., Weisenburger, D., Armitage, K., Levy, R., Wilson, W.,Greve, M., Byrd, J., Botstein, D., Brown, P. & Staudt, L. (2000) Nature403, 503-511.

[0007] 2. Eisen, M., Spellman, P., Brown, P. & Botstein, D. (1998) Proc.Natl. Acad. Sci. USA 95, 14863-14868.

[0008] 3. Weinstein, J., Myers, T., O'Connor, P., Friend, S., Fornace,A., Kohn, K., Fojo, T., Bates, S., Rubinstein, L., Anderson, N.,Buolamwini, J., van Osdol, W., Monks, A., Scudiero, D., Sausville, E.,Zaharevitz, D., Bunow, B., Viswanadhan, V., Johnson, G., Wittes, R. &Paull, K. (1997) Science 275, 343-349.

[0009] Cluster analysis works best for a large number of samples.Moreover, cluster analysis provides little information about statisticalsignificance. To answer biologically important questions, a method isneeded which can analyze a relatively small number of samples andprovide a measure of statistical certainty. Methods based onconventional t-tests provide the probability (p) that a difference ingene expression occurred by chance. See for example, the followingarticles:

[0010] 4. Roberts, C., Nelson, B., Marton, M., Stoughton, R., Meyer, M.,Bennett, H., He, Y., Dai, H., Walker, W., Hughes, T., Tyers, M., Boone,C. & Friend, S. (2000) Science 287, 873-880.

[0011] 5. Galitski, T., Saldanha, A., Styles, C., Lander, E. & Fink, G.(1999) Science 285, 251-254.

[0012] In conventional t tests, p=0.01 may be significant in the contextof experiments designed to evaluate small numbers of genes. However, amicroarray experiment for 10,000 genes would identify 100 genes bychance.

[0013] One approach for ascertaining the statistical significance ofmicroarray data is known as the “fold change” method. In this approach,if one were interested in measuring the effects of radiation on geneexpression, a number of biological samples are subjected to radiation,and their gene expression is then measured. Other biological samples aremeasured without being subjected to radiation. The “fold change” methodidentifies genes as having been changed significantly by the radiationif the ratio of the average gene expression measured after beingsubjected to the radiation to the gene expression measured without beingsubjected to radiation is greater than a certain threshold or less thenanother threshold. As further explained below, the “fold change” method,in some instances, yields unacceptably high false discovery rates.

[0014] In one attempt to improve on the “fold change” method, genes areidentified to be significantly changed if a certain fold change isobserved consistently between paired samples. While this yields amoderate improvement over the “fold change” method, this improved “pairwise fold change” method still yields a rather high false discoveryrate.

[0015] As also noted above, conventional techniques analyze differencesin gene expression levels, such as PM-MM, so that negative expressionvalues are possible during analysis. Conventional methods of calculationand graphical representation employ log-log plots which do not permitnegative values. Where linear plots are used instead for representingsuch possible negative values, it is found, however, that most of thevalues in the plots tend to congregate in a small area so that it isdifficult to resolve them visually. It is, therefore, desirable toprovide improved techniques for calculation and representation of data.

[0016] It is, therefore, desirable to provide an improved system foranalyzing and representing data obtained from microarrays whereby theabove-described difficulties are alleviated.

SUMMARY OF THE INVENTION

[0017] A new method, referred to herein as Significance Analysis ofMicroarrays (SAM), identifies genes with statistically significantdifferences in expression or other biological characteristics (such asgene copy number or levels of protein encoded by the genes), referred tobelow as values associated with the genes, by assimilating a set ofgene-specific microarray data. For example, SAM may assign each gene ascore representing such associated values, based on differences in geneexpression or other biological characteristics in the data relative tothe standard deviation of repeated measurements for that gene. Geneswith scores greater than an adjustable threshold are deemed potentiallysignificant. In some situations, gene expression may vary over a widerange of values, so that, in order to take full advantage of statisticalanalysis, it is preferable to choose statistical parameters forcharacterizing genes so that statistical significance can be assesseddespite such variation of values. Preferably the parameters are chosenso that they are substantially independent of the ranges of values thatcharacterize the genes. Thus, where a plurality of genes are associatedwith a plurality of sets of values obtained from data sources, astatistical parameter is provided that contains information concerningdifferences in the associated values of the genes among the sets. In oneimplementation, the parameters of the genes are adjusted so that theparameters are substantially independent of the average associatedvalues of the genes over the sets. An observed value and an expectedvalue of the adjusted parameter are calculated and compared to identifygenes whose associated values differ by an amount of statisticalsignificance among the sets. The sets of associated values of genes maybe obtained from measurements using microarrays, data derived from suchmeasurements, calculations or predictions using gene models, or otherdata sources.

[0018] As noted above, gene expression or other biologicalcharacteristics of genes may vary over a wide range of values.Therefore, for genes whose expression or other characteristics have highvalues, even a difference that is a small percentage of the high valuesmay overshadow and mask larger relative differences for genes whoseexpression or other characteristics have lower values. Furthermore,factors inherent in the process of acquisition of the data analyzed mayintroduce noise that may mask changes or differences in gene expression,or cause genes to be erroneously identified as having changes ofstatistical significance. This problem can be alleviated by ranking thegenes by their values of the parameter, and by deriving expected valuesof the parameter of different ranks. The expected value for theparameter for each rank is then compared with the value of the parameterof the gene of the same rank to identify genes that exhibit changes ofstatistical significance.

[0019] In one embodiment, the expected value for the parameter for eachrank is obtained by permuting the associated values of genes, deriving avalue of such parameter for each gene in each permutation, ranking thevalues of the parameter, and obtaining an average value of the parameterof each rank for the permutations.

[0020] Inherent in some statistical methods such as the one describedabove is that some genes may be erroneously identified as ones withstatistically significant differences in expression or othercharacteristics. A good indication of the effectiveness of the method isto compute a false discovery rate for the method.

[0021] To estimate the percentage of such genes identified by chance(the false discovery rate, FDR), nonsense genes are identified byanalyzing permutations of the measurements. The threshold score can beadjusted to identify smaller or larger sets of genes, and FDRs arecalculated for each set.

[0022] The FDR may be found by permuting the associated values of genes,deriving a value of such parameter for each permutation, ranking thevalues of the parameter, and comparing the values of the parameter to athreshold to find the FDR. In one embodiment, this is implemented bycounting the number of genes with parameter values that exceed apositive threshold or fall below a negative threshold. One possiblemethod for estimating the FDR is to define FDR as the number of suchnonsense genes divided by the number of actual genes with parametervalues that exceed the positive threshold or fall below the negativethreshold.

[0023] Where SAM is used in connection with data analysis of diseases,gene expression or other characteristic values may correlate withpatient survival time. In such event, pairs of death and risk sets maybe defined, each pair having a corresponding patient death time, wherethe death set includes associated values corresponding to the death timeand the risk set includes values corresponding to times occurring afterthe death time. A parameter is then provided for each of the genescontaining information concerning differences in the associated valuesof the gene among the sets. An observed and an expected value of theparameter for each gene are then derived and compared to identify genesthat exhibit behavior of statistical significance.

[0024] To avoid the problem inherent in the conventional technique ofusing sharp thresholds in deriving representative values of genes,smooth weighting functions may be used to reduce distortion. In order toanalyze and/or represent expression levels that may be negative orpositive in value, odd root values may be analyzed and/or graphicallydisplayed so that the values do not congregate in a small area in theplot, and this facilitates analysis and comparison.

[0025] The above-described features may be embodied as a program ofinstructions executable by computer to perform the above-describeddifferent aspects of the invention. Hence, any of the techniquesdescribed above may be performed by means of software components loadedinto a computer or any other information appliance or digital device.When so enabled, the computer, appliance or device may then perform theabove-described techniques to assist the analysis of sets of valuesassociated with a plurality of genes in the manner described above, orfor comparing such associated values. The software component may beloaded from a fixed media or accessed through a communication mediumsuch as the internet or any other type of computer network. The abovefeatures embodied in one or more computer programs may be performed byone or more computers running such program(s).

[0026] Each of the inventive features described above may be usedindividually or in combination in different arrangements. All suchcombinations and variations are within the scope of the invention.

BRIEF DESCRIPTION OF THE DRAWINGS

[0027]FIG. 1A is a linear scatter plot of gene expression in a samplehybridized to two microarrays using a conventional technique, where eachgene (i) in the microarray is represented by a point with coordinatesconsisting of gene expression measured in uninduced cell line 1 fromhybridization A, x_(U1A)(i), and gene expression from the same cell linefrom hybridization B, x_(U1B)(i).

[0028]FIG. 1B is a cube root scatter plot of gene expression from thedata in FIG. 1A to illustrate an aspect of the invention.

[0029]FIG. 1C is a cube root scatter plot of average gene expression(avg x_(A)) from the four A hybridizations (induced and uninduced, celllines 1 and 2) and the four similar B hybridizations (avg x_(B)) toillustrate an aspect of the invention.

[0030]FIG. 1D is a cube root scatter plot of average gene expressionfrom the four hybridizations with uninduced cells (avg x_(U)) andinduced cells 4 hr after exposure to 5 Gy of IR (avg x₁), where some ofthe genes that responded to IR are indicated by arrows to illustrate anaspect of the invention.

[0031] FIGS. 2A-2F are scatter plots of relative difference in geneexpression d(i) versus gene specific scatter s(i), where the data werepartitioned to calculate d(i) as indicated by the bar codes, and wherethe shaded and unshaded entries were used for the first and second termsin the numerator of d(i) in Equation 1 set forth below.

[0032]FIG. 2A illustrates the relative difference between irradiated andunirradiated states, where the statistic d(i) was computed fromexpression measurements partitioned between irradiated and unirradiatedcells.

[0033]FIG. 2B illustrates the relative difference between cell lines 1and 2, where the statistic d(i) was computed from expressionmeasurements partitioned between cell lines 1 and 2.

[0034]FIG. 2C illustrates the relative difference between hybridizationsA and B, where the statistic d(i) was computed from the permutation inwhich the expression measurements were partitioned between theequivalent hybridizations A and B.

[0035]FIGS. 2D, 2E, 2F illustrate the relative differences for threepermutations of the data that were balanced between cell lines 1 and 2.

[0036] FIGS. 3A-3C illustrates a process for identification of geneswith significant changes in expression.

[0037]FIG. 3A is a scatter plot of the observed relative difference d(i)versus the expected relative difference d_(E)(i), where the solid lineat 45 degrees indicates the line for d(i)=d_(E)(i), where the observedrelative difference is identical to the expected relative difference,and where the dotted lines are drawn at a distance Δ=1.2 from the solidline.

[0038]FIG. 3B is scatter plot of d(i) versus scatter s(i).

[0039]FIG. 3C is a cube root scatter plot of average gene expression ininduced and uninduced cells, where the cutoffs for 2-fold induction andrepression are indicated by the dashed lines, and where in all panels,the 46 potentially significant genes for Δ=1.2 are indicated by thesquares.

[0040] FIGS. 4A-4C illustrate a process for comparison of SAM toconventional methods for analyzing microarrays.

[0041]FIG. 4A illustrates falsely significant genes plotted againstnumber of genes called significant, where of the 57 genes most highlyranked by the fold change method, 5 were included among the 46 genesmost highly ranked by SAM.

[0042]FIG. 4B is a Northern blot validation for genes identified by thefold change method, where values of r(i) are plotted for genes chosen atrandom from the 57 genes most highly ranked by the fold change method.

[0043]FIG. 4C is a Northern blot validation for genes identified by SAM,where results are plotted for genes chosen at random from the 46 genesmost highly ranked by SAM. The straight lines in FIGS. 4B and 4Cindicate the position of exact agreement between Northern blot andmicroarray results.

[0044]FIG. 5A is a graphical plot of a scatter function to illustrateeffects of a conventional technique for processing gene expression whicheliminates contributions from probes that diverge from a mean value by apredetermined cutoff.

[0045]FIG. 5B is a graphical plot of a scatter function to illustrateeffects of the use of a Gaussian weighting function for processing geneexpression to illustrate an aspect of the invention.

[0046]FIG. 6 is a block diagram showing a representative sample logicdevice in which aspects of the present invention may be embodied.

[0047] For simplicity in description, identical components are labelledin the same numerals in this application.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

[0048] Because of its biological importance, SAM is applied to thetranscriptional response of lymphoblastoid cells to ionizing radiation(IR). Although the data were obtained from oligonucleotide microarraysrepresenting 6800 genes, SAM can also be applied to cDNA microarrays ina similar manner.

[0049] Materials and Methods Used in the Invention:

[0050] Preparation of RNA. Lymphoblastoid cell lines GM14660 and GM08925(Coriell Cell Repositories, Camden, N.J.) were seeded at 2.5×10⁵cells/ml and exposed to 5 Gy 24 hours later. RNA was isolated, labeledand hybridized to the HuGeneFL GeneChip® microarray according tomanufacturer's protocols (Affymetrix, Santa Clara, Calif.).

[0051] Microarray hybridization. Each gene in the microarray wasrepresented by 20 oligonucleotide pairs, each pair consisting of anoligonucleotide perfectly matched to the cDNA sequence and a secondoligonucleotide containing a single base mismatch. Because geneexpression was computed from differences in hybridization to the matchedand mismatched probes, expression levels were sometimes reported by theGeneChip® Analysis Suite software as negative numbers. To compare datafrom different microarray hybridizations, a reference data set wasconstructed from the average expression for each gene over the 8 datasets. Gene expression for each hybridization was plotted against thereference data set in a cube root scatter plot and scaled by a linearfit to the data points. Data were then cubed to return values to theoriginal scale.

[0052] Northern blot hybridization. Total RNA (15 μg) was resolved byagarose gel electrophoresis, transferred to a nylon membrane, andhybridized to specific DNA probes, which were prepared by PCRamplification.

[0053] Results of Applying the Invention to a Biological System:

[0054] RNA was harvested from two wild type human lymphoblastoid celllines, designated 1 and 2, growing in an unirradiated state U, or in anirradiated state I, 4 hr after exposure to a modest dose of 5 Gy of IR.RNA samples were labeled and divided into two identical aliquots forindependent hybridizations, A and B. Thus, data was generated from eighthybridizations (U1A, U1B, U2A, U2B, I1A, I1B, I2A, I2B).

[0055] To assess reproducibility in the data, identical aliquots of anmRNA sample (U1A and U1B) were analyzed with two microarrays from thesame manufacturing lot. A linear scatter plot for gene expressionconfirmed that the data was generally reproducible (FIG. 1A), but failedto resolve the vast majority of genes that are expressed at low levels.To better resolve these genes, we chose to display the data in a cuberoot scatter plot. This permitted the inclusion of negative levels ofexpression that are sometimes generated by the GeneChip® software. Thecube root scatter plot (FIG. 1B) revealed three salient features: thelarge percentage of genes (24%) assigned negative levels of expression,the large percentage of genes with low levels of expression, and the lowsignal to noise ratio at low levels of expression.

[0056]FIG. 1A is a linear scatter plot of gene expression in a samplehybridized to two microarrays using a conventional technique, where eachgene (i) in the microarray is represented by a point with coordinatesconsisting of gene expression measured in uninduced cell line 1 fromhybridization A, x_(U1A)(i), and gene expression in the same cell linefrom hybridization B, x_(U1B)(i). As can be observed from FIG. 1A, onlya small number of highly expressed genes are resolved visually, withmost of the genes compressed into a small region of the plot so thatthey would be difficult to resolve visually. One method of distributingsuch data points more uniformly is a logarithmic scatter plot, but thelog function cannot accept the negative values for gene expressiongenerated by the microarrays. FIG. 1B is a cube root scatter plot ofgene expression from the data in FIG. 1A to illustrate an aspect of theinvention. As will be clear from a comparison of FIGS. 1A, 1B, the geneswith lower expression levels are more visually resolved in cube rootplot of FIG. 1B compared to FIG. 1A. While cube root plots areillustrated herein, it will be understood that the fifth root or otherodd root plots may be used instead and are within the scope of theinvention.

[0057] After scaling the data from different microarray hybridizations,a scatter plot was generated for average gene expression in the four Aaliquots vs. the average in the four B aliquots, a partitioning of thedata that eliminates biological changes in gene expression. The scatterwas improved by averaging multiple data sets (compare FIGS. 1B and 1C).FIG. 1C is a cube root scatter plot of average gene expression from thefour A hybridizations (avg x_(A)) and the four B hybridizations (avgx_(B)).

[0058] To assess the biological effect of IR, a scatter plot wasgenerated for average gene expression in the four irradiated states vs.the four unirradiated states (compare FIGS. 1C and 1D). FIG. 1D is acube root scatter plot of average gene expression from the fourhybridizations with uninduced cells (avg x_(U)) and induced cells 4 hrafter exposure to 5 Gy of IR (avg x_(I)), where some of the genes thatresponded to IR are indicated by arrows to illustrate an aspect of theinvention. A few of the potentially significant changes in geneexpression are indicated by arrows in FIG. 1D, but the effect was noteasily quantified, and it is desirable to provide a better method toidentify changes with a level of statistical confidence.

[0059] The approach adopted herein was based on analysis of randomfluctuations in the data. In general, the signal to noise ratiodecreased with decreasing gene expression (FIGS. 1B-1D). However, evenfor a given level of expression, it is found that fluctuations were genespecific. To account for gene-specific fluctuations, a statistic isdefined based on the ratio of change in gene expression to standarddeviation in the data for that gene. The “relative difference” d(i) ingene expression is:

d(i)=[ x _(I)(i)− x _(U)(i)]/[x(i)+s ₀]  (1)

[0060] where x _(I)(i) and x _(U)(i) are defined as the average levelsof expression for gene (i) in states I and U, respectively. The“gene-specific scatter” s(i) is the standard deviation in the data:

s(i)=({Σ_(m) [x _(m)(i)− x _(I)(i)]²+Σ_(n) [x _(n)(i)− x _(U)(i)]²}{1/n₁+1/n ₂ }/{n ₁ +n ₂−2})^(1/2)   (2)

[0061] where Σ_(m) and Σ_(n) are summations of the expressionmeasurements in states I and U, respectively, and n₁ and n₂ are thenumbers of measurements in states I and U (4 in this experiment). Aconstant s₀=3.3 was chosen by minimizing the coefficient of variation ofthe standard deviation of d(i) as a function of s(i), thus permittingd(i) values to be compared among all genes in the microarray. While arelative difference parameter d(i) as set forth in equation (1) ispreferable, it will be understood that other difference functions thatdepend on the differences between the associated values of the genesamong the sets (e.g. set of measurements in state U and set of same instate I) and on scatter values among the sets may be used and are withinthe scope of the invention.

[0062] As noted above, factors inherent in the process of acquisition ofmicroarray data itself may introduce noise that renders it difficult todiscover the significance of differences in gene expression or otherbiological behavior or falsely identify genes to be of statisticalsignificance. To overcome such problem, a number of methods aredescribed above which allow full utilization of the microarray data. Onedifficulty in making use of the microarray data is due to the fact thatthe expression levels of the genes have a wide range of values orscattered values. It is, therefore, desirable to adjust the parameterd(i) so that it is essentially independent of the wide variation of thevalues of the parameter d(i) and/or of the scatter value s(i). After theparameter has been so adjusted, then all of the data can be fullyutilized.

[0063] In one embodiment, the adjustment is accomplished by dividing thescatter values or average associated values of the genes into subsetseach having a similar range of values. For example, the scatter valuesor average associated values of the genes may be divided into tensubsets in accordance with which percentile such values fall into. Inother words, the first of the ten subsets will contain the top tenthpercentile of the scatter values or average associated values of thegenes, the second subset containing the second to the top tenthpercentile of such values and so on. The standard deviation of theparameter d(i) is then calculated within each subset and a coefficientof variation of the standard deviations of the parameter values for theten subsets is then minimized by varying the value of the constant s₀appearing in equation 1. After the constant s₀ has been so adjusted, theparameter d(i) is then substantially independent of wide variations inscatter values or average associated values of the genes, so that all ofthe microarray data can be effectively used.

[0064] Scatter plots of d(i) vs. log[s(i)] are shown in FIGS. 2A-2Fwhich are scatter plots of relative difference in gene expression d(i)versus gene specific scatter s(i), where the data were partitioned tocalculate d(i) as indicated by the bar codes, and where the shaded andunshaded entries were used for the first and second terms in thenumerator of d(i) in Equation 1 set forth below. FIG. 2A illustrates therelative difference between irradiated and unirradiated states, wherethe statistic d(i) was computed from expression measurements partitionedbetween irradiated and unirradiated cells. By contrast, the scatter plotfor relative difference between cell lines 1 and 2 shows more markedchanges in FIG. 2B, which illustrates the relative difference betweencell lines 1 and 2. In FIG. 2B, the statistic d(i) was computed fromexpression measurements partitioned between cell lines 1 and 2. Thus,the relative difference between cell lines 1 and 2 appears to exceedthat between irradiated and unirradiated states.

[0065] These relative differences exceeded random fluctuations in thedata, as measured by the relative difference between hybridizations Aand B in FIG. 2C which illustrates the relative difference betweenhybridizations A and B. In FIG. 2C, the statistic d(i) was computed fromthe permutation in which the expression measurements were partitionedbetween the equivalent hybridizations A and B.

[0066] Although the relative difference computed from hybridizations Aand B provided a control for random fluctuations, additional controlswere desirable to assign statistical significance to the biologicaleffect of IR. Instead of performing more experiments, which areexpensive and labor-intensive, a large number of controls are generatedby computing relative differences from permutations of thehybridizations for the 4 irradiated and 4 unirradiated states. Tominimize potentially confounding effects from differences between thetwo cell lines, the data was analyzed using the 36 permutations thatwere balanced for cell lines 1 and 2. Permutations were defined asbalanced when each group of four experiments contained two experimentsfrom cell line 1 and two experiments from cell line 2. FIGS. 2D, 2E, 2Fillustrate the relative differences for three permutations of the datathat were balanced between cell lines 1 and 2.

[0067] Relative differences from random permutations of thehybridizations indicate noise inherent in the process of dataacquisition. From the examples illustrated above, it is seen thatrelative differences stemming from the differences between cell linesmay mask statistically significant changes in gene expression caused byradiation, so that for this reason, it may be preferable to use onlydata from balanced permutations, to reduce the effects on the statisticsfrom differences between the cell lines.

[0068] Another control that can be exerted is by ranking the values ofthe relative difference parameter d(i) Although gene expression levelscan vary widely, the relative difference d(i) is a measure ofstatistical significance substantially independent of expression level.As another control for assigning statistical significance, the largestrelative differences from the 36 permutations may indicate noise fromstatistical fluctuations in the data. One may compute the average valueof the largest relative differences from all 36 permutations. Thus,comparing the largest relative difference among all the genes to thelargest relative differences from the permutations provides one possibletest for identifying genes to be of statistical significance. Therefore,the average of the largest relative differences from the 36 permutationsis the expected relative difference for such gene. A comparison of therelative difference of such gene with its expected value can be used ascontrol as to whether statistical significance should be assigned tosuch gene. The same reasoning applies to the gene of the second highestrelative difference and comparison to the second largest relativedifferences from the permutations, and so on for all the genes involvedin the calculation.

[0069] In other words, to find significant changes in gene expression,genes were ranked by magnitude of their d(i) values, so that d(1) is thelargest relative difference, d(2) is the second largest relativedifference, and d(i) is the i^(th) largest relative difference, or theith rank. For each of the 36 balanced permutations, relative differencesd_(p)(i) are also calculated, and the genes are again ranked such thatd_(p)(i) was the i^(th) largest relative difference for permutation p.The expected d_(E)(i) was defined as the average over the 36 balancedpermutations,

d _(E)(i)=Σ_(p) d _(p)(i)/36   (3)

[0070] FIGS. 3A-3C illustrates a process for identification of geneswith significant changes in expression. FIG. 3A is a scatter plot of theobserved relative difference d(i) versus the expected relativedifference d_(E)(i), in which the solid line indicates the line ford(i)=d_(E)(i), where the observed relative difference is identical tothe expected relative difference, and in which the dotted lines aredrawn at a distance Δ=1.2 from the solid line. FIG. 3B is a scatter plotof d(i) versus scatter s(i). FIG. 3C is a cube root scatter plot ofaverage gene expression in induced and uninduced cells, where thecutoffs for 2-fold induction and repression are indicated by the dashedlines, and where in all panels, the 46 potentially significant genes forΔ=1.2 are indicated by the squares.

[0071] To identify potentially significant changes in expression, ascatter plot of the observed relative difference d(i) vs. the expectedrelative difference d_(E)(i) (FIG. 3A) is used. For the vast majority ofgenes, d(i)≅d_(E)(i). However, some genes are represented by pointsdisplaced from the d(i)=d_(E)(i) line by a distance greater than athreshold Δ. For example, the threshold Δ=1.2 illustrated by the brokenlines in FIG. 3A yielded 46 genes that were “called significant.” These46 genes are shown in the context of the scatter plot for d(i) vs.log[s(i)] (FIG. 3B) and in the scatter plot for the cube root of geneexpression x _(I)(i) vs. x _(U)(i) (FIG. 3C). Clearly, genes identifiedby d(i) do not necessarily have the largest changes in gene expression.

[0072] As noted above, the relative differences of the variouspermutations indicate noise inherent in the data acquisition process.Such relative differences may then be used to determine the number ofgenes falsely identified to be of statistical significance. Falsediscovery rate may be found by comparing such relative differences tothresholds. FIG. 3A may be used for such purposes as well, where the“observed” relative difference d(i) in the figure is one obtained frompermutations as described below.

[0073] In one embodiment, to determine the number of falsely significantgenes generated by SAM, horizontal cutoffs were defined as the smallestd(i) among the genes called significantly induced and the least negatived(i) among the genes called significantly repressed. The number offalsely significant genes corresponding to each permutation was computedby counting the number of genes that exceeded the horizontal cutoffs forinduced and repressed genes. The estimated number of falsely significantgenes was the average of the number of genes called significant from all36 permutations. Table 1, attached hereto as appendix A and made part ofthis application, shows the results for different values of Δ. ForΔ=1.2, the permuted data sets generated an average of 8.4 falselysignificant genes, compared to 46 genes called significant, yielding anestimated FDR of 18%. As Δ decreased, the number of genes calledsignificant by SAM increased, but at the cost of an increasing FDR.(Omitting s₀ from Equation 1 produced higher FDRs of 45%, 35%, and 28%for Δ=0.6, 0.9, and 1.2.).

[0074] Thus, as illustrated in FIG. 3A, the “observed” relativedifference d(i) is plotted against expected relative difference d_(E)(i)for all of the 36 permutations. To arrive at the plot in FIG. 3A, boththe “observed” and the expected relative differences are computed usingthe associated values of the genes in the 36 permutations usingequations (1)-(3) above.

[0075] One then proceeds from the point 12 (at coordinates (0,0)) in theplot and proceed along line 14 at 45° to the axis in the positivedirection along arrow 16. When the smallest positive “observed” relativedifference d(i) is encountered that exceeds the expected relativedifference d_(E)(i) by a set threshold defined by dotted line 17, suchas at point 20, such value of the d(i) is then set as a horizontalthreshold. This value of d(i) then becomes a horizontal cutoff 22, sothat the number of genes with positive “observed” relative differencevalues exceeding such threshold 22 from the 36 permutations compared tothe unpermitted data would provide an indication of the false discoveryrate for induced genes. This accounts for the falsely significant genesthat are induced.

[0076] To discover the falsely significant genes that are repressed, onewould then proceed again from point 12 along line 14 but along thenegative direction 18 until one again encounters at point 30 the leastnegative observed relative difference d(i) that exceeds the expectedrelative difference d_(E)(i) by a set threshold indicated by dotted line19. Such smallest negative d(i) is then set as the negative horizontalcutoff threshold 32. The genes whose negative relative differences aremore negative than such horizontal cutoff 32 from the permitted andunpermitted data are used to estimate the FDR.

[0077] To test the above described method for determining the FDR,artificial data sets are constructed in which a subset of genes wasinduced over a background of noise. When SAM was used to analyze suchdata sets, the estimated FDR accurately predicted the correct number offalsely significant genes.

[0078] The above method for setting thresholds provides asymmetriccutoffs for induced and repressed genes. In other words, the magnitudesof the two horizontal cutoffs 22, 32 need not be the same. Thealternative is the standard t-test, which imposes a symmetric horizontalcutoff, with a d(i)>c for induced genes and a d(i)<−c for repressedgenes. However, the asymmetric cutoff is preferred because it allows forthe possibility that d(i) for induced and repressed genes may behavedifferently in some biological experiments.

[0079] FIGS. 4A-4C illustrate a process for comparison of SAM toconventional methods for analyzing microarrays. FIG. 4A illustratefalsely significant genes plotted against number of genes calledsignificant, where of the 57 genes most highly ranked by the fold changemethod, 5 were included among the 46 genes most highly ranked by SAM. Ofthe 38 genes most highly ranked by the pairwise fold change method, 11were included among the 46 genes most highly ranked by SAM. Theseresults were consistent with the FDRs of SAM compared to the fold changeand pairwise fold change methods.

[0080]FIG. 4B is a Northern blot validation for genes identified by thefold change method, where values of r(i) are plotted for genes chosen atrandom from the 57 genes most highly ranked by the fold change method.The genes are: cyclin F (1); parathymosin (2); N-acetylglucosaminyltransferase (3); eIF-4 gamma (4); dynamin (5); interferonconsensus sequence binding protein (6); heart muscle specific proteinDRAL/SLIM3/FHL-2 (7), U1 snRNP-specific C protein (8); and maxi Kpotassium channel beta subunit (9).

[0081]FIG. 4C is a Northern blot validation for genes identified by SAM,where results are plotted for genes chosen at random from the 46 genesmost highly ranked by SAM: maxi K potassium channel beta subunit (9);cyclin B (10); PLK (11); ckshs2 (12); IL2 receptor beta chain (13);PTP(CAAX1) (14); p48 (15); XPC (16); Fas (17); and mdm2 (18).

[0082] SAM proved to be superior to conventional methods for analyzingmicroarrays (Table 1 and FIG. 4A). First, SAM was compared to theapproach of identifying genes as significantly changed if an R-foldchange was observed. In this “fold change” method, r(i)=x _(I)(i)/x_(U)(i), and gene (i) was called significantly changed if r(i)>R orr(i)<1/R. To permit computation of r(i) from negative values for geneexpression, x _(I)(i) and x _(U)(i) were converted to 10 when theirvalues were negative or less than 10. The results of this procedureyielded unacceptably high FDRs of 73% to 84%.

[0083] Another approach attempts to account for uncertainty in the databy identifying genes as significantly changed if an R-fold change isobserved consistently between paired samples (7). To apply this“pairwise fold change” method to our 4 data sets before and 4 data setsafter IR, changes in gene expression were declared significant if 12 of16 pairings satisfied the criteria r(i)>R or r(i)<1/R. Despite thedemand for consistent changes between paired samples, this methodyielded FDRs of 60% to 71%.

[0084] To understand why fold-change methods fail, note that the vastmajority of genes are expressed at low levels where the signal to noiseratio is very low (FIG. 3C). Thus, 2-fold changes in gene expressionoccur at random for a large number of genes. Conversely, for higherlevels of expression, smaller changes in gene expression may be real,but these changes are rejected by fold-change methods. The pairwise foldchange method provides modest improvement and remained inferior to SAM.

[0085] Of the 46 genes most highly ranked by SAM (Δ=1.2), 36 increasedor decreased at least 1.5-fold with r(i)≧1.5 or r(i)≦0.67. The number offalsely significant genes that met these two criteria was 4.5,corresponding to a FDR of 12%. Fas was identified 3 times as alternatelyspliced forms, leaving 34 independent genes. As an indication ofbiological validity, 10 of the 34 genes have been reported in theliterature as part of the transcriptional response to IR. TNF-α wasreported to be induced by others under different conditions (8) but wasrepressed here. We validated our microarray result by Taq-Man PCR.

[0086] To test the validity of SAM directly, Northern blots wereperformed for genes that were randomly selected from the 46 and 57 genesmost highly ranked by SAM (Δ=1.2) and the fold change method (at least3.6-fold change), respectively. Northern blots showed little correlationwith the genes identified by the fold change method (FIG. 4B), butstrong correlation with the genes identified by SAM (FIG. 4C). Indeed,Northern blots contradicted only 1 of 10 genes identified by SAM,consistent with our estimated FDR.

[0087] Nineteen of the 34 genes most highly ranked by SAM appear to beinvolved in the cell cycle. Three are known to be induced in ap53-dependent manner: p21, cyclin G1, and mdm2 (9-11). Six cell cyclegenes were repressed: ubiquitin carrier protein E2-EPF, p55cdc, cyclinB, ckshs2, cdc25 phosphatase, and weel kinase (12, 13). Five genesencoding the mitotic machinery were also repressed: PLK kinase, mitotickinesin-like protein 1 (MKLP-1), mitotic centromere-associated kinesin(MCAK), cdc25 associated protein kinase (CTAK1), and the kinetocoremotor CENP-E (14-16). Four genes involved in cell proliferation wereinduced or repressed: the farnesylated protein tyrosine phosphatasePTP(CAAX1), OX40 ligand, lymphocyte phosphatase associatedphosphoprotein (LPAP), and c-myc (17-21). Some responses wereparadoxical. For example, cdc25 phosphatase and weel kinase haveantagonistic effects on the phosphorylation state of cdc2, but bothgenes were repressed. Repression of these genes together with themitotic genes may represent a damage response that dismantles the cellcycle machinery until the cell has repaired the damaged DNA.

[0088] Four of the 34 genes play roles in DNA repair, but none areinvolved in the repair of IR-induced double-strand breaks. Instead, thegenes (p48, XPC, gadd45, PCNA) have roles in nucleotide excision repair,a pathway conventionally associated with UV-induced damage (22-25). Weconfirmed the induction of these genes by Northern blot (26-28). Fornaceet al. reported defective removal of base damage induced by IR inxeroderma pigmentosum cells (29). Leadon et al. reported that a novelDNA repair pathway involving long excision repair patches of at least150 nucleotides is activated by IR, but not UV (30). Our results suggestthat this novel pathway might include p48, XPC, gadd45, and PCNA.

[0089] Three of the 34 genes play roles in apoptosis (Fas, bcl-2 bindingcomponent 3, TNF-α). The remaining genes may have previously unsuspectedroles in the DNA damage response, or may be among the estimated set offour falsely detected genes. Attached hereto as Appendix B and made apart of this application is Table 2, which sets forth the genes withchanges in expression called significant by SAM.

[0090] Discussion

[0091] The 34 genes most highly ranked by SAM are only a subset of allthe genes that change 1.5-fold with IR. The difference between thenumber of genes called significant and the number of falsely significantgenes was calculated for decreasing values of Δ=0.3, 0.2 and 0.1, andfound the difference to be 92, 170, and 184 respectively. Thus SAMsuggests that at least 180 genes are induced or repressed by 5 Gy IR.

[0092] In conclusion, SAM successfully identified those genes on amicroarray with bona fide changes in expression. Here, SAM found geneswhose expression changed between two states. SAM can also be generalizedto other types of experiments by expressing d(i) in other ways. Supposethe data includes gene expression x_(J)(i) and a response parametery_(j), in which i=1, 2, . . . , m genes, j=1, 2, . . . , n samples. Thegeneralized statistical parameter still takes the formd(i)=r(i)/[s(i)+s₀]. Only the definitions of r(i) and s(i) change. Forexample, r(i) can be correlated with factors other than irradiation,such as different type of tumors or survival time, as described in moredetail below, where r(i) simply indicates relative differences inassociated values, not necessarily those caused by changes due toradiation.

[0093] To identify genes whose expression is specifically different in asubset of a set of samples, the parameter d(i) is defined in terms ofthe Fisher's linear discriminant. One goal might be to identify geneswhose expression in one type of tumor is different from its expressionin other types of tumors. Suppose that a set of n samples consists of Knon-overlapping subsets, with y_(j) ε{1, . . . , K}. DefineC(k)={j:y_(j)=k}. Let n_(k)=number of observations in C(k). The averagegene expression in each subset is x _(k)(i)=Σ_(jεC(k)) x_(J)(i)/n_(k)and the average gene expression for all n samples isx(i)=Σ_(j)x_(j)(i)/n. Then define:

r(i)={[Σ_(k) n _(k)/Π_(k) n _(k) [x _(k)(i)− x (i)]²}^(1/2)   (4)

s(i)={[Σ_(k)(1/n _(k))/Σ_(k)(n _(k)−1)]Σ_(k)Σ_(jεC(k)) [x _(j)(i)− x_(k)(i)]²}^(1/2)   (5)

[0094] The quantity r(i) in equation 4 is the variance between subsets,and the quantity s(i) in equation 5 the sum of variances within eachsubset. Each subset may be data collected from a type of tumor. Thus alarge value for the generalized statistical parameter d(i) indicates adifference in gene expression between subsets, or between the differenttypes of tumors. The value of s₀ in d(i) is adjusted in a manner similarto that above by permuting the parameter k among the tumor subsets.

[0095] Thus, in general, where the associated values in each set can beclassified into two or more subsets with values in each subset having acorrelation with one another, a parameter may be selected using aquantity related to variances between the associated values in thesubsets of the sets and the variances of the associated values withinsuch subset of the sets. The quantity may relate to the sum of thevariances between the associated values and the subsets of the sets andthe sum or variances of the associated values within each subset of thesets.

[0096] To identify genes whose expression correlates with survival time,d(i) is defined in terms of the Cox's proportional hazards function.Express the response data in the form y_(J)=(t_(j), δ_(J)). Here,t_(J)=survival time for patient (j) or censored survival time if thepatient is still alive or lost to follow-up, and δ_(j)=0 or 1, dependingon whether patient (j) was censored (δ_(j)=0) or died with a knownsurvival time t_(j) (δ_(J)=1). Assume that there are K unique deathtimes z₁, z₂, . . . , z_(K). Let D(k), for k=1, . . . , K be death setsD(k)={i:t₁=z_(k)}. Let R(k) be risk sets R(k)={i:t₁≧z_(k)}. Letm_(k)=number of patients in R(k). Let d_(k)=number of deaths at timez_(k). The average expression of gene (i) in death set D(k) is: x_(k)*(i)=Σ_(JεD(k)) x_(j)(i)/d_(k). The average expression of gene (i) inrisk set R(k) is: x _(k)(i)=Σ_(JεR(k)) x_(j)(i)/m_(k). Then define:

r(i)=Σ_(k) d _(k) [x _(k)*(i)− x _(k)(i)]  (6)

s(i)={Σ_(k)(d _(k) /m _(k))Σ_(jεR(k)) [x _(J)(i)− x _(k)(i)]²}^(1/2)  (7)

[0097] SAM can be adapted for still other types of experimental data.For example, to identify genes whose expression correlates with aquantitative parameter, such as tumor stage, d(i) can be defined interms of the Pearson correlation coefficient, as described in moredetail in the example below.

[0098] A method for identifying genes whose expression correlates with acontinuous parameter would be one identifying genes whose expression ina tumor correlates with survival time of the patient with the tumor.

[0099] Let x_(k)(i) be the expression of gene (i) in sample (k) (e.g.,the kth tumor). Define x(i) to be the average expression of gene (i)over all the samples.

[0100] Let y_(k) be the value for the continuous parameter (e.g., time)associated with sample (k). Define y to be the average of the continuousparameter over all the samples.

[0101] The Pearson correlation coefficient, r(i) for gene (i) is:

r(i)=Σ_(k) [x _(k)(i)− x (i)][y _(k) −y]/[Σ _(k) [x _(k)(i)− x(i)]²Σ_(k)(y _(k) −y )²]^(1/2)   (8)

[0102] The values for r(i) are less than +1 and greater than −1. Forr(i)≈+1, the corrleation is strongly positive. For r(i)≈−1, thecorrelation is strongly negative. An example of a modified Pearsoncorrelation coefficient that could serve as the parameter d(i) is:

d(i)=Σ_(k) [x _(k)(i)− x (i)][y _(k) −y]/{[Σ _(k) [x _(k)(i)− x(i)]²Σ_(k)(y _(k) −y )²]^(1/2) +s ₀}  (9)

[0103] The value of s₀ would be adjusted in the manner described above,thus permitting comparison across the entire set of genes. To computethe expected d(i), the survival times are permuted among the tumors.

[0104] In addition to applications using the Pearson correlationcoefficient, another example includes the definition of d(i) for paireddata, such as gene expression in tumors before and after chemotherapy.In each case, the FDR is estimated by random permutation of the data forgene expression among the different experimental arms, i.e.,permutations among the n arms of y_(j).

[0105] Weighting Function to Improve Data Reproducibility

[0106] For microarrays that contain several probes for each gene,expression is typically computed as a simple mean or a trimmed mean,which eliminates contributions from probes that diverge from the mean bya predetermined cutoff. Such methods fail to eliminate uncertainty inthe data arising from probes that do not behave appropriately as shownin FIG. 5A.

[0107] Data reproducibility can be improved by modifying thecontribution of each probe by a continuous weighting function. Forexample, the weight for probe (i) of a given gene can be determined by aGuassian weight function,

w(i)=exp[−(x ₁ −x ₀)² /a ²]  (10)

[0108] where x₀=mean or median of data from all the probes for the gene,a=constant multiplied by standard deviation or median absolute deviationof data. When the Gaussian weight function was applied to an experimentin which the same sample was hybridized twice to two microarrays, therewas major improvement in the data (FIG. 5B). The scatter functiondecreased by more than a factor of two and the number of negativelyexpressed genes decreased from 25% to 0.001%.

[0109] Thus, SAM is a robust and straightforward method that can beadapted to a broad range of experimental situations. SAM and itsmodifications are available for use athttp://www-stat-class.stanford.edu/SAM/SAMServlet. This web site is usedat Stanford University.

[0110] Software Implementation

[0111] The invention has been described above, employing methods andproducing plots as illustrated in the Figures. Such methods and graphsor plots may be produced with the aid of machines such as computers.Therefore, another aspect of the invention involves the softwarecomponents that are loaded to a computer to perform the above-describedfunctions. These functions provide results with the different advantagesoutlined above. The software or program components may be installed in acomputer in a variety of ways.

[0112] As will be understood in the art, the inventive softwarecomponents may be embodied in a fixed media program component containinglogic instructions and/or data that when loaded into an appropriatelyconfigured computing device to cause that device to perform according tothe invention. As will be understood in the art, a fixed media programmay be delivered to a user on a fixed media for loading in a userscomputer or a fixed media program can reside on a remote server that auser accesses through a communication medium in order to download aprogram component. Thus another aspect of the invention involvestransmitting, or causing to be transmitted, the program component to auser where the component, when downloaded into the user's device, canperform any one or more of the functions described above.

[0113]FIG. 6 shows an information appliance (or digital device) 40 thatmay be understood as a logical apparatus that can read instructions frommedia 47 and/or network port 49. Apparatus 40 can thereafter use thoseinstructions to direct server or client logic, as understood in the art,to embody aspects of the invention. One type of logical apparatus thatmay embody the invention is a computer system as illustrated in 40,containing CPU 44, optional input devices 49 and 41, disk drives 45 andoptional monitor 46. Fixed media 47 may be used to program such a systemand may represent a disk-type optical or magnetic media, magnetic tape,solid state memory, etc. One or more aspects of the invention may beembodied in whole or in part as software recorded on this fixed media.Communication port 49 may also be used to initially receive instructionsthat are used to program such a system to perform any one or more of theabove-described functions and may represent any type of communicationconnection, such as to the internet or any other computer network. Theinstructions or program may be transmitted directly to a user's deviceor be placed on a network, such as a website of the internet to beaccessible through a user's device. All such methods of making theprogram or software component available to users are known to those inthe art and will not be described here.

[0114] The invention also may be embodied in whole or in part within thecircuitry of an application specific integrated circuit (ASIC) or aprogrammable logic device (PLD). In such a case, the invention may beembodied in a computer understandable descriptor language which may beused to create an ASIC or PLD that operates as herein described.

[0115] While the invention has been described above by reference tovarious embodiments, it will be understood that changes andmodifications may be made without departing from the scope of theinvention, which is to be defined only by the appended claims and theirequivalents. All references referred to herein are incorporated byreference in their entireties.

What is claimed is:
 1. A method for analyzing a plurality of sets of values associated with a plurality of genes to identify genes whose associated values differ by an amount of statistical significance among the sets, wherein each of the sets of associated values of the genes is obtained from one of a number of data sources, wherein the method comprises: providing for each of the plurality of genes a parameter that contains information concerning differences in the associated values of that gene among the sets; adjusting the parameters of the plurality of genes so that the parameters are substantially independent of scatter values or average associated values of the genes over the sets; deriving an observed value and an expected value of the adjusted parameter for each gene from the sets of associated values; and comparing the observed and expected values of the parameter to identify genes whose associated values differ by an amount of statistical significance among the sets.
 2. The method of claim 1, wherein said adjusting includes: dividing the scatter values or average associated values of the genes into subsets each having a similar range of values, and calculating the standard deviation of each of the parameters within each subset; altering the parameters until a coefficient of variation of the standard deviations of the parameters among the subsets is minimized.
 3. The method of claim 1, further comprising obtaining said sets of associated values from multiple measurements of the plurality of genes, or values derived therefrom.
 4. The method of claim 1, wherein said sets of associated values represent gene expression or number of gene copies or levels of protein encoded by the genes.
 5. The method of claim 1, wherein said sets of associated values include calculated or predicted values.
 6. The method of claim 1, wherein said providing includes calculating a difference value between an associated value of each gene in a first of the sets or a value derived therefrom and an associated value of that gene in a second of the sets or a value derived therefrom; wherein the parameter is a function of the difference value of that gene.
 7. The method of claim 6, wherein said providing further includes: generating for each of the plurality of genes a scatter value that quantifies variation in the associated values of that gene within the first and second sets; and wherein said parameter is a function of the scatter value and of the difference value, said parameter defining a relative difference value of that gene.
 8. The method of claim 7, wherein said generating employs the following equation: s(i)=({1/a}{Σ _(m) [x _(m)(i)− x _(I)(i)]²+Σ_(n) [x _(n)(i)− x _(U)(i)]²})^(1/2) where gene (i) has associated values x_(I)(i) and x_(U)(i) in Ith and Uth states respectively in the first and second sets of associated values, I and U being positive integers; Σ_(m) and Σ_(n) are sums over associated values of gene (i) in states I in the first set and in states U in the second set respectively, where s(i) is the scatter value of gene (i), and a is a constant.
 9. The method of claim 8, wherein said calculating calculates the parameter d(i) from the following equation: d(i)=[ x _(I)(i)− x _(U)(i)]/[s(i)+s ₀] where s₀ is a constant, and x _(I)(i) and x _(U)(i) are the average values of x_(I)(i) and x_(U)(i) respectively in the first and second sets of associated values.
 10. The method of claim 9, further comprising: dividing the scatter values or average associated values of the genes into subsets each having a similar range of values, and calculating the standard deviation of each of the parameters within each subset; and altering value of s₀ until a coefficient of variation of the standard deviations of the parameters among the subsets is minimized.
 11. The method of claim 1, wherein said associated values of the genes are correlated with another variable so that each of said associated values has a corresponding value of the variable, and wherein the parameter is provided using a Pearson correlation coefficient related to a weighted difference between each of the associated values and an average associated value, the variance of the associated values and the variance of the variable, said difference weighted by deviation of the corresponding value of the variable of such associated value from its average value.
 12. The method of claim 11, wherein said variable is continuous.
 13. The method of claim 12, wherein said variable is time.
 14. The method of claim 11, wherein the parameter is selected using the Pearson correlation coefficient and a quantity s₀ that has a value adjusted as follows: dividing the scatter values or average associated values of the genes into subsets each having a similar range of values, and calculating the standard deviation of each of the parameters within each subset; and altering value of s₀ until a coefficient of variation of the standard deviations of the parameters among the subsets is minimized.
 15. The method of claim 11, the number of sets of associated values being k, k being a positive integer, wherein said Pearson correlation coefficient r(i) is given by: ${r(i)} = {\sum_{k}{{\left\lbrack \left( {{x_{k}(i)} - {\underset{\_}{x}(i)}} \right) \right\rbrack \left\lbrack \left( {y_{k} - \underset{\_}{y}} \right) \right\rbrack}/\sqrt{\sum_{k}{\left( {{x_{k}(i)} - {\underset{\_}{x}(i)}} \right)^{2}{\sum_{k}\left( {y_{k} - \underset{\_}{y}} \right)^{2}}}}}}$

where x_(k)(i) is the associated value of gene (i) in the kth set of associated values, x(i) the average of the associated values of gene (i) in all the sets, y_(k) the value of the variable corresponding to x_(k)(i), y the average value of y_(k) in all the sets, and Σ_(k) is a sum over all values of k.
 16. The method of claim 1, wherein the associated values in each set are classified into two or more subsets with values in each subset having a correlation with one another, and wherein the parameter is selected using a quantity related to variances between the associated values in the subsets of the sets and the variances of the associated values within each subset of the sets.
 17. The method of claim 16, wherein the quantity relates to the sum of variances between the associated values in the subsets of the sets and the sum of variances of the associated values within each subset of the sets.
 18. The method of claim 17, wherein the parameter is selected using the Fisher discriminant and a quantity so having a value which has been adjusted as follows: dividing the scatter values or average associated values of the genes into subsets each having a similar range of values, and calculating the standard deviation of each of the parameters within each subset; and altering value of s₀ until a coefficient of variation of the standard deviations of the parameters among the subsets is minimized.
 19. The method of claim 18, wherein the number of subsets of associated values of such set being k, k being a positive integer, and the Fisher discriminant F(i) is given by: F(i)=Σ_(k) n _(k) [x _(k)(i)− x (i)]²/Σ_(k)Σ_(j) [x _(j)(i)− x _(k)(i)]² where x_(k)(i) is an associated value of gene (i) in the kth subset of associated values, x_(k) (i) the average of the associated values of gene (i) in the kth subset, x(i) the average value of the associated values of gene (i) in all of the subsets, n_(k) the number of associated values in the kth set, Σ_(j) a sum over all the associated values of gene (i) in the kth subset, and Σ_(k) a sum of the associated values of gene (i) over all of the subsets.
 20. The method of claim 1, the sets of associated values referred to as original sets, wherein said deriving includes deriving said expected value by: permuting, for each of the plurality of genes, the associated values for such gene in the original sets to arrive at a number of different permutations; classifying the associated values in each permutation of each gene into corresponding permuted sets that are different from the original sets; and supplying for each permutation a parameter value of each of the genes derived from an associated value of such gene in each of the corresponding permuted sets for such permutation or values derived therefrom.
 21. The method of claim 20, wherein said associated values of the genes are correlated with another variable so that each of said associated values has an associated value of the variable, wherein the permuting permutes the associated values so that at least each of some of the associated values has a different associated variable.
 22. The method of claim 21, wherein the associated values are classified into two or more subsets with values in each subset having a correlation with one another, wherein the permuting permutes the associated values so that at least each of some of the associated values is in a subset different from the subset it is classified into.
 23. A method for analyzing a plurality of sets of values associated with a plurality of genes to identify genes whose associated values differ by an amount of statistical significance among the sets, wherein the associated values correlate with patient survival time, and wherein the associated values of the genes are obtained from a number of data sources, said method comprising: defining pairs of death and risk sets, each pair having a corresponding patient death time, where the death set of such pair includes associated values corresponding to the death time of such pair and the risk set of such pair includes associated values corresponding to times occurring after the death time of such pair; providing for each of the plurality of genes a parameter that contains information concerning differences in the associated values of that gene among the sets; deriving an observed value and an expected value of the parameter for each gene from the sets of associated values; and comparing the observed and expected values of the parameter to identify genes whose associated values differ by an amount of statistical significance.
 24. The method of claim 23, wherein said providing provides said parameter as a function of weighted differences between the average associated values of the death and risk sets of the pairs, and of weighted variances within the risk sets.
 25. The method of claim 24, wherein said providing provides for gene (i) said parameter by means of r(i) and s(i) given by the following: r(i)=Σ_(k) d _(k) [x _(k)*(i)− x _(k)(i)]s(i)={Σ_(k)(d _(k) /m _(k))Σ_(JεR(k)) [x _(j)(i)− x _(k)(i)]²}^(1/2) where there are K unique death times z₁, z₂, . . . , z_(K); D(k), for k=1, . . . , K, are death sets defined by D(k)={i:t₁=z_(k)}; R(k) are risk sets defined by R(k)={i:t₁≧z_(k)}; m_(k) is number of patients in R(k); d_(k) is number of patient deaths at time z_(k), an average expression of gene (i) in death set D(k) is given by: x _(k)*(i)=Σ_(jεD(k)) x _(j)(i)/d _(k);  and an average expression of gene (i) in risk set R(k) is given by: x _(k)(i)=Σ_(JεR(k)) x _(j)(i)/m _(k).
 26. The method of claim 24, wherein said providing provides said parameter by means of r(i) and s(i) given by the following: r(i)/[s(i)+s₀], where s₀ is a constant.
 27. The method of claim 24, further comprising: dividing the scatter values or average associated values of the genes into subsets each having a similar range of values, and calculating the standard deviation of each of the parameters within each subset; and altering value of s₀ until a coefficient of variation of the standard deviations of the parameters among the subsets is minimized.
 28. A method for analyzing a plurality of original sets of values associated with a plurality of genes to identify genes whose associated values differ by an amount of statistical significance among the sets, wherein each of the sets of associated values of the genes is obtained from one of a number of data sources, wherein the method comprises: calculating for each gene a value for a statistical parameter indicating differences between associated values of such gene among the original sets; ranking the values of the parameter of the genes; providing an expected value of such parameter for each rank, wherein said providing includes permuting the associated values in the original sets to arrive at sets different from the original sets for each permutation, deriving a value of such parameter for each permutation, and ranking such values; and comparing the calculated and expected values for the parameter of the same rank to identify genes whose associated values differ by an amount of statistical significance among the sets.
 29. The method of claim 28, wherein said providing comprises: for each permutation, deriving a value of the parameter for each gene and ranking the genes by their associated parameter values; and determining the expected value of such parameter for each rank by computing an average value of the parameter of all the permutations having such rank.
 30. The method of claim 29, wherein said comparing comprises identifying a gene as one whose associated values differ by an amount of statistical significance among the sets when the difference for such gene between the calculated value of the parameter of a rank and the expected value of such parameter of the same rank exceeds a threshold.
 31. The method of claim 29, wherein said method further comprises identifying a lowest rank gene whose parameter value derived for a permutation is positive and exceeds a first threshold, setting such parameter value as a second threshold, comparing the derived parameter values of other genes for permutations to the second threshold and calling each gene whose derived parameter value exceeds the second threshold as a gene whose associated values are falsely identified to differ by an amount of statistical significance among the sets.
 32. The method of claim 29, wherein said method further comprises identifying a lowest rank gene whose parameter value derived for a permutation is negative and less than a first threshold, setting such parameter value as a second threshold, comparing the derived parameter values of other genes for permutations to the second threshold and calling each gene whose derived parameter value is less than the second threshold as a gene whose associated values are falsely identified to differ by an amount of statistical significance among the sets.
 33. The method of claim 28, wherein the sets of associated values in each permutation contains approximately an equal number of associated values from each of the original sets of associated values.
 34. A method for analyzing a plurality of original sets of values associated with a plurality of genes to identify genes whose associated values are falsely identified to differ by an amount of statistical significance among the sets, wherein each of the sets of associated values of the genes is obtained from one of a number of data sources, wherein the method comprises: defining for each gene a statistical parameter indicating differences between associated values of such gene among the original sets; providing an expected value of such parameter for each gene, wherein said providing includes permuting the associated values in the sets to arrive at sets different from the original sets for each permutation, deriving a value of such parameter for each permutation, and ranking such values; deriving for each gene a value for the parameter for each permutation and ranking the genes by their derived parameter values; finding a lowest rank gene whose derived parameter value extends beyond a first threshold; and comparing the derived parameter values of other genes for permutations to the second threshold and calling each gene whose derived parameter value extends beyond the second threshold as a gene whose associated values are falsely identified to differ by an amount of statistical significance among the sets.
 35. A method for reducing statistical error of a set of associated values of genes, wherein the method comprises: providing a set of associated values of each gene; and processing said set of associated values of that gene using a smooth weighting function to yield a representative value for that gene.
 36. The method of claim 35, wherein said processing uses a Gaussian weighting function.
 37. A method for comparing sets of associated values of genes, which comprises: providing sets of associated values of each gene; processing said sets of associated values of that gene using a smooth weighting function to obtain a representative value for that gene from each of the sets; and comparing representative values for that gene for the sets.
 38. The method of claim 37, wherein said providing includes calculating a difference PM-MM of a probe pair of a microarray.
 39. A method for comparing a first and a second set of associated values of genes, which comprises: providing odd root values of the values in the first set, and odd root values of the values in the second set; and comparing the odd root values of the values in the first set and the odd root values of the values in the second sets.
 40. The method of claim 39, wherein said providing provides the cube or fifth root values of the values in the first or second sets.
 41. The method of claim 40, wherein said representing includes scaling the odd root values along the two axes, and wherein said method further comprises providing a best fit curve for the odd root values of the first and second set in the plot.
 42. The method of claim 39, wherein said comparing includes representing the odd root values of the values in the first set along a first axis of a two-dimensional plot and the odd root values of the values in the second set along a second axis of the plot.
 43. The method of claim 39, wherein said odd root values provided and compared includes values derived from positive and negative associated values.
 44. A computer readable storage device embodying a program of instructions executable by a computer to perform a method for analyzing a plurality of sets of values associated with a plurality of genes to identify genes whose associated values differ by an amount of statistical significance among the sets, wherein each of the sets of associated values of the genes is obtained from one of a number of data sources, wherein the method comprises: providing for each of the plurality of genes a parameter that contains information concerning differences in the associated values of that gene among the sets; adjusting the parameters of the plurality of genes so that the parameters are substantially independent of scatter values or average associated values of the genes over the sets; deriving an observed value and an expected value of the adjusted parameter for each gene from the sets of associated values; and comparing the observed and expected values of the parameter to identify genes whose associated values differ by an amount of statistical significance among the sets.
 45. A computer readable storage device embodying a program of instructions executable by a computer to perform a method for analyzing a plurality of sets of values associated with a plurality of genes to identify genes whose associated values differ by an amount of statistical significance among the sets, wherein the associated values correlate with patient survival time, and wherein the associated values of the genes are obtained from a number of data sources, said method comprising: defining pairs of death and risk sets, each pair having a corresponding patient death time, where the death set of such pair includes associated values corresponding to the death time of such pair and the risk set of such pair includes associated values corresponding to times occurring after the death time of such pair; providing for each of the plurality of genes a parameter that contains information concerning differences in the associated values of that gene among the sets; deriving an observed value and an expected value of the parameter for each gene from the sets of associated values; and comparing the observed and expected values of the parameter to identify genes whose associated values differ by an amount of statistical significance.
 46. A computer readable storage device embodying a program of instructions executable by a computer to perform a method for analyzing a plurality of original sets of values associated with a plurality of genes to identify genes whose associated values differ by an amount of statistical significance among the sets, wherein each of the sets of associated values of the genes is obtained from one of a number of data sources, wherein the method comprises: calculating for each gene a value for a statistical parameter indicating differences between associated values of such gene among the original sets; ranking the values of the parameter of the genes; providing an expected value of such parameter for each rank, wherein said providing includes permuting the associated values in the original sets to arrive at sets different from the original sets for each permutation, deriving a value of such parameter for each permutation, and ranking such values; and comparing the calculated and expected values for the parameter of the same rank to identify genes whose associated values differ by an amount of statistical significance among the sets.
 47. A computer readable storage device embodying a program of instructions executable by a computer to perform a method for analyzing a plurality of original sets of values associated with a plurality of genes to identify genes whose associated values are falsely identified to differ by an amount of statistical significance among the sets, wherein each of the sets of associated values of the genes is obtained from one of a number of data sources, wherein the method comprises: defining for each gene a statistical parameter indicating differences between associated values of such gene among the original sets; providing an expected value of such parameter for each gene, wherein said providing includes permuting the associated values in the sets to arrive at sets different from the original sets for each permutation, deriving a value of such parameter for each permutation, and ranking such values; deriving for each gene a value for the parameter for each permutation and ranking the genes by their derived parameter values; finding a lowest rank gene whose derived parameter value extends beyond a first threshold; and comparing the derived parameter values of other genes for permutations to the second threshold and calling each gene whose derived parameter value extends beyond the second threshold as a gene whose associated values are falsely identified to differ by an amount of statistical significance among the sets.
 48. A computer readable storage device embodying a program of instructions executable by a computer to perform a method for reducing statistical error of a set of associated values of genes, wherein the method comprises: providing a set of associated values of each gene; and processing said set of associated values of that gene using a smooth weighting function to yield a representative value for that gene.
 49. A computer readable storage device embodying a program of instructions executable by a computer to perform a method for comparing sets of associated values of genes, which comprises: providing sets of associated values of each gene; processing said sets of associated values of that gene using a smooth weighting function to obtain a representative value for that gene from each of the sets; and comparing representative values for that gene for the sets.
 50. A computer readable storage device embodying a program of instructions executable by a computer to perform a method for comparing a first and a second set of associated values of genes, which comprises: providing odd root values of the values in the first set, and odd root values of the values in the second set; and comparing the odd root values of the values in the first set and the odd root values of the values in the second sets.
 51. A method for transmitting a program of instructions executable by a computer to perform a method for analyzing a plurality of sets of values associated with a plurality of genes to identify genes whose associated values differ by an amount of statistical significance among the sets, wherein each of the sets of associated values of the genes is obtained from one of a number of data sources, wherein the method comprises: causing a program of instructions to be transmitted to a client device, thereby enabling the client device to perform, by means of such program, the following process: providing for each of the plurality of genes a parameter that contains information concerning differences in the associated values of that gene among the sets; adjusting the parameters of the plurality of genes so that the parameters are substantially independent of scatter values or average associated values of the genes over the sets; deriving an observed value and an expected value of the adjusted parameter for each gene from the sets of associated values; and comparing the observed and expected values of the parameter to identify genes whose associated values differ by an amount of statistical significance among the sets.
 52. A method for transmitting a program of instructions executable by a computer to perform a method for analyzing a plurality of sets of values associated with a plurality of genes to identify genes whose associated values differ by an amount of statistical significance among the sets, wherein the associated values correlate with patient survival time, and wherein the associated values of the genes are obtained from a number of data sources, said method comprising: causing a program of instructions to be transmitted to a client device, thereby enabling the client device to perform, by means of such program, the following process: defining pairs of death and risk sets, each pair having a corresponding patient death time, where the death set of such pair includes associated values corresponding to the death time of such pair and the risk set of such pair includes associated values corresponding to times occurring after the death time of such pair; providing for each of the plurality of genes a parameter that contains information concerning differences in the associated values of that gene among the sets; deriving an observed value and an expected value of the parameter for each gene from the sets of associated values; and comparing the observed and expected values of the parameter to identify genes whose associated values differ by an amount of statistical significance.
 53. A method for transmitting a program of instructions executable by a computer to perform a method for analyzing a plurality of original sets of values associated with a plurality of genes to identify genes whose associated values differ by an amount of statistical significance among the sets, wherein each of the sets of associated values of the genes is obtained from one of a number of data sources, wherein the method comprises: causing a program of instructions to be transmitted to a client device, thereby enabling the client device to perform, by means of such program, the following process: calculating for each gene a value for a statistical parameter indicating differences between associated values of such gene among the original sets; ranking the values of the parameter of the genes; providing an expected value of such parameter for each rank, wherein said providing includes permuting the associated values in the original sets to arrive at sets different from the original sets for each permutation, deriving a value of such parameter for each permutation, and ranking such values; and comparing the calculated and expected values for the parameter of the same rank to identify genes whose associated values differ by an amount of statistical significance among the sets.
 54. A method for transmitting a program of instructions executable by a computer to perform a method for analyzing a plurality of original sets of values associated with a plurality of genes to identify genes whose associated values are falsely identified to differ by an amount of statistical significance among the sets, wherein each of the sets of associated values of the genes is obtained from one of a number of data sources, wherein the method comprises: causing a program of instructions to be transmitted to a client device, thereby enabling the client device to perform, by means of such program, the following process: defining for each gene a statistical parameter indicating differences between associated values of such gene among the original sets; providing an expected value of such parameter for each gene, wherein said providing includes permuting the associated values in the sets to arrive at sets different from the original sets for each permutation, deriving a value of such parameter for each permutation, and ranking such values; deriving for each gene a value for the parameter for each permutation and ranking the genes by their derived parameter values; finding a lowest rank gene whose derived parameter value extends beyond a first threshold; and comparing the derived parameter values of other genes for permutations to the second threshold and calling each gene whose derived parameter value extends beyond the second threshold as a gene whose associated values are falsely identified to differ by an amount of statistical significance among the sets.
 55. A method for transmitting a program of instructions executable by a computer to perform a method for reducing statistical error of a set of associated values of genes, wherein the method comprises: causing a program of instructions to be transmitted to a client device, thereby enabling the client device to perform, by means of such program, the following process: providing a set of associated values of each gene; and processing said set of associated values of that gene using a smooth weighting function to yield a representative value for that gene.
 56. A method for transmitting a program of instructions executable by a computer to perform a method for comparing sets of associated values of genes, which comprises: causing a program of instructions to be transmitted to a client device, thereby enabling the client device to perform, by means of such program, the following process: providing sets of associated values of each gene; processing said sets of associated values of that gene using a smooth weighting function to obtain a representative value for that gene from each of the sets; and comparing representative values for that gene for the sets.
 57. A method for transmitting a program of instructions executable by a computer to perform a method for comparing a first and a second set of associated values of genes, which comprises: causing a program of instructions to be transmitted to a client device, thereby enabling the client device to perform, by means of such program, the following process: providing odd root values of the values in the first set, and odd root values of the values in the second set; and comparing the odd root values of the values in the first set and the odd root values of the values in the second sets.
 58. A computer system for analyzing a plurality of sets of values associated with a plurality of genes to identify genes whose associated values differ by an amount of statistical significance among the sets, wherein each of the sets of associated values of the genes is obtained from one of a number of data sources, wherein the system comprises: one or more computers; one or more computer programs running on the computer(s), performing the following: providing for each of the plurality of genes a parameter that contains information concerning differences in the associated values of that gene among the sets; adjusting the parameters of the plurality of genes so that the parameters are substantially independent of scatter values or average associated values of the genes over the sets; deriving an observed value and an expected value of the adjusted parameter for each gene from the sets of associated values; and comparing the observed and expected values of the parameter to identify genes whose associated values differ by an amount of statistical significance among the sets.
 59. A computer system for analyzing a plurality of sets of values associated with a plurality of genes to identify genes whose associated values differ by an amount of statistical significance among the sets, wherein the associated values correlate with patient survival time, and wherein the associated values of the genes are obtained from a number of data sources, said system comprising: one or more computers; one or more computer programs running on the computer(s), performing the following: defining pairs of death and risk sets, each pair having a corresponding patient death time, where the death set of such pair includes associated values corresponding to the death time of such pair and the risk set of such pair includes associated values corresponding to times occurring after the death time of such pair; providing for each of the plurality of genes a parameter that contains information concerning differences in the associated values of that gene among the sets; deriving an observed value and an expected value of the parameter for each gene from the sets of associated values; and comparing the observed and expected values of the parameter to identify genes whose associated values differ by an amount of statistical significance.
 60. A computer system for analyzing a plurality of original sets of values associated with a plurality of genes to identify genes whose associated values differ by an amount of statistical significance among the sets, wherein each of the sets of associated values of the genes is obtained from one of a number of data sources, wherein the system comprises: one or more computers; one or more computer programs running on the computer(s), performing the following: calculating for each gene a value for a statistical parameter indicating differences between associated values of such gene among the original sets; ranking the values of the parameter of the genes; providing an expected value of such parameter for each rank, wherein said providing includes permuting the associated values in the original sets to arrive at sets different from the original sets for each permutation, deriving a value of such parameter for each permutation, and ranking such values; and comparing the calculated and expected values for the parameter of the same rank to identify genes whose associated values differ by an amount of statistical significance among the sets.
 61. A computer system for analyzing a plurality of original sets of values associated with a plurality of genes to identify genes whose associated values are falsely identified to differ by an amount of statistical significance among the sets, wherein each of the sets of associated values of the genes is obtained from one of a number of data sources, wherein the system comprises: one or more computers; one or more computer programs running on the computer(s), performing the following: defining for each gene a statistical parameter indicating differences between associated values of such gene among the original sets; providing an expected value of such parameter for each gene, wherein said providing includes permuting the associated values in the sets to arrive at sets different from the original sets for each permutation, deriving a value of such parameter for each permutation, and ranking such values; deriving for each gene a value for the parameter for each permutation and ranking the genes by their derived parameter values; finding a lowest rank gene whose derived parameter value extends beyond a first threshold; and comparing the derived parameter values of other genes for permutations to the second threshold and calling each gene whose derived parameter value extends beyond the second threshold as a gene whose associated values are falsely identified to differ by an amount of statistical significance among the sets.
 62. A computer system for reducing statistical error of a set of associated values of genes, wherein the system comprises: one or more computers; one or more computer programs running on the computer(s), performing the following: providing a set of associated values of each gene; and processing said set of associated values of that gene using a smooth weighting function to yield a representative value for that gene.
 63. A computer system for comparing sets of associated values of genes, which comprises: one or more computers; one or more computer programs running on the computer(s), performing the following: providing sets of associated values of each gene; processing said sets of associated values of that gene using a smooth weighting function to obtain a representative value for that gene from each of the sets; and comparing representative values for that gene for the sets.
 64. A computer system for comparing a first and a second set of associated values of genes comprising one or more computers; one or more computer programs running on the computer(s), performing the following: providing odd root values of the values in the first set, and odd root values of the values in the second set; and comparing the odd root values of the values in the first set and the odd root values of the values in the second sets. 